-
Notifications
You must be signed in to change notification settings - Fork 486
[server] Coordinator Server Supports High-Available #1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
not ready for review |
3afaff9 to
5bf08aa
Compare
|
@zcoo if it is not ready for review, you can click on |
|
@michaelkoepf i get it, thank you |
830229d to
9d8f97e
Compare
|
Ready for review now! @wuchong @swuferhong |
|
I have a question here. Do we need to properly handle
Should we take these two concerns into consideration in the same way as Kafka? |
Thanks for reminder. I considered Do you have any suggestions? |
IMO, we can split them into different PRs, but they should be merged together? Otherwise, we might introduce other metadata inconsistency issues when introducing HA. What's your opinion? |
| // Do not return, otherwise the leader will be released immediately. | ||
| while (true) { | ||
| try { | ||
| Thread.sleep(1000); | ||
| } catch (InterruptedException e) { | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we don't use LeaderLatch here if we need to hold the leadership?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good suggestion to me! It seems LeaderLatch is more suitable in this scene and I will try to use it instead
84c1db7 to
18a3ff8
Compare
| if (tryElectCoordinatorLeaderOnce()) { | ||
| startCoordinatorLeaderService(); | ||
| } else { | ||
| // standby | ||
| CoordinatorLeaderElection coordinatorLeaderElection = | ||
| new CoordinatorLeaderElection(zkClient.getCuratorClient(), serverId); | ||
| coordinatorLeaderElection.startElectLeader( | ||
| () -> { | ||
| try { | ||
| startCoordinatorLeaderService(); | ||
| } catch (Exception e) { | ||
| throw new RuntimeException(e); | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic looks strange to me. Why do we explicitly create the path for the first election, but use LeaderLatch for the standby Coordinator? Since we’re already using the framework, perhaps we could standardize on LeaderLatch?
Also, this logic seems problematic. If cs-1 becomes the leader on the first startup and later loses leadership due to a network issue, it appears it will never participate in the election again, because no LeaderLatch was registered for the node that became leader during the first startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think we don't need to use special logic in first round election. I have changed it.
|
Hi @zcoo. Overall, I have some concerns about several issues:
|
d88c76c to
434a4f4
Compare
74aca0b to
85c6d38
Compare
|
@LiebingYu Sure. I think we need to these things and I will introduce epoc mechanism in this PR soon |
0d260c9 to
ca027a1
Compare
c98d28a to
90730f1
Compare
90730f1 to
8539a53
Compare
8539a53 to
14c6203
Compare
14c6203 to
c851189
Compare
Purpose
Coordinator Server Supports High-Available.
Linked issue: close #188
Brief change log
Tests
see : com.alibaba.fluss.server.coordinator.CoordinatorServerElectionTest
API and Format
Documentation
Currently I make it by using an active (or leader) coordinator server and several standby (or alive) coordinator server. When several coordinator servers start up at the same time, only one can successfully preempt the ZK node and win the election and become active coordinator server leader. Once it failovers, the standby servers will take over it with an leader election process.